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ABSTRACT 

Researchers increasingly recognize that significance tests are 
limited in their ability to inform scientific practice. Common 
errors in interpreting significance tests, and three strategies 
for augmenting the interpretation of significance test results, 
are illustrated. The first strategy for augmenting the 
interpretation of significance tests involves evaluating 
significance test results in a sample size context. A second 
strategy involves interpretation of effect size estimates; 
several estimates and corrections are discussed* A third 
strategy emphasizes interpretation based on estimated likelihood 
that results will replicate. The methods of Efron, and cross- 
validation strategies are illustrated. 
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statistically significant results are obtained, but 
thoughtful interpretation of the results suggests that observed 
effects are not noteworthy • This situation occurs with increasing 
frequency as researchers become sore aware that significance 
tests are limited in their potential to inform valid 
interpretations in scientific inquiry (Carver, 1978) . Selecting 
the best indices to use when evaluating empirical results has 
become a subject of much debate (Huberty, 1987; Thompson, 1989a, 
1989b; Rosnow fit Rosenthal, 1989; and Welge-Crow, LeCluyse & 
Thompson, 1990). 

Shaver (1979, pp, 5-6) has argued that 
The emphasis on statistics and the **test of 
significance** procedure has resulted in a 
methodological orientation toward establishing 
general izability that has been deleterious in it's 
effects on the scientific accumulation of knowledge 
in education. 

Similarly, Carver (1978, p. 378) states that, "Statistical 
significance testing has involved more fantasy than fact." 

The traditional emphasis on significance testing and the 
direct interpretation of inferential test results has led to the 
need for improved editorial policies and scholarly practices 
(Thompson, 1987). It has been increasingly recognized that 
signif icancfj testing only directly aids the interpretation of the 
results in special cases, e.g., significant results from small 
sample sets, and not ^n all cases (Thompson, 1987). Thus, 
significance testing is only useful when employed by thoughtful 
researchers, and especially by researchers who are aware of 
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strategies that nake the interpretation of the test results nost 
meaningful. The purpose of the present paper is to discuss three 
strategies useful in augmenting the interpretation of 
significance test results. First, however, a review of the 
influence of sample size on the outcomes of significance testing 
warrants some consideration. 

mia Influence of gj>»plft size on S ignificance Teat Results 

The size of the sample used in a significance test has 
direct bearing on the results of the test. Selecting an 
appropriate sample size is ono of the most difficult problems 
found in designing research (McNamara, 1990a). When working with 
large samples, virtually all null hypotheses will be rejected, 
since the "null hypothesis of no difference is almost never 
exactly true in the population" (Thompson, 1987, p. 14). Stated 
a different way. Hays (1981, p. 293) argues that "virtually any 
study can be made to show significant results if one uses enough 
subjects." Welge-Crow, LeCluyse and Thompson (1990) demonstrate 
Hays' argument using a concrete heuristic example that will serve 
here to emphasize the influence that sample size has on 
significance test results. 

Presume the^t a researcher decided to compare the mean IQ 
scores of 12,000 students located in one zip code of the Houston 
Independent School District with the mean IQ of the remaining 
188,000 students residing in other zip codes. If the mean IQ of 
the 12,000 students in the zip code of interest is only 100.15 
(Sfi=15) , and the mean of the remaining 188,000 students is 99.85 
(SD=15) , the two means differ to a statistically significant 



(Zcalc - 2,12 > Zcrit ■ 1.96, fi<.05) degree. The less thoughtful 
interpretation of these statistically significant findings would 
be that the 12,000 students differ appreciably in their 
intellects fron theiz 188,000 peers. Perhaps it will even be 
suggested that a school for the gifted be built in this zip code. 
Yet, these differences can hardly be considered noteworthy. 

A thoughtful researcher would note, in such a situation, 
that the standardized difference in these two mean (.3/15 * 0.02) 
is trivially small. The researcher would also be aware that all 
measurements are limited and imperfect, and that the difference 
of the means (0.3 being one-third of one IQ point) is 
substantially less than the standard error of measurement (SEM) 
of an IQ measure with a reliability coefficient of 0.92 (SEM « 15 
(l-.92)^.5 » 15 (.08)^.5 = 15 (.2828) = 4.243). The researcher 
who applies the significance test and only interprets the 
inferential results (as against the researcher that truly 
understands the interpretation and realizes that there is not a 
real difference between the samples) is less likely to offer 
helpful recommendations for policy or worthy contributions to 
theory elaboration. 

In choosing sample size, Hinkle, Wiersma, and Jurs (1988, p. 
293) state three relevant assumptions: 

1. It is sensible to view research findings based 
on large samples as more reliable than findings 
based upon smaller samples, all other things 
being equal. 

2. However, inferential statistical methods will 
not result in rejecting the null hypothesis if. 



in thm design of thm study, an inappropriately 
SMll sanple vas salactad. 

3. In a vall-plannad raaaarch study, in which tha 
varianca of the criterion variable is likely to 
be quite large and the treatment effects rather 
small, large samples are appropriate and 
justifiable. 

Thus, sample size is of the utmost importance for the 
interpretation of significance testing results. Morrison and 
HenXel (1970) and Carver (1978) explain the limits of 
significance testing as an aid to scientific practice. Only once 
a sample is determined to be large enough to detect certain 
effect sizes can interpretation of significance test results be 
made more directly. In short, sample size must be considered when 
making interpretations of the empirical results. Otherwise, 
testing significance becomes only a test of whether a large 
sample is in hand, which the researcher presumably already knows, 
before the significance test is even conducted. 
1. Evaluating Significance Test Results in a Sample Size Context 

The first strategy for augmenting interpretation of 
significance tests is the strategy of evaluating the expected or 
the obtained effect size in relation to changes in sample size 
(Thompson, 1989a) . Prospectively, before data are collected, the 
researcher must determine what sample size is needed to obtain a 
statistically significant result for a given effect size. 
Retrospectively, once data have been collected, the researcher 
must determine how variations in sample size might have altered 
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the significance decision, assuning that the effect size is 
generalizable and thus is taken as fixed. 

Helge-Crov, LeCluyse and Thompson (1990) illustrate this 
approach with the following application. Say a researcher 
detected a large effect size of 33. 6%. Table 1 presents 
significance tests associated with this effect size taken as a 
fixed value, but assuming different sample sizes had been used. 
The table can be viewed as presenting results for either a 
multiple regression analysis involving two predictor variables 
(in which case the **r sq** effect size would be called the squared 
multiple correlation coefficient, B^) or an analysis of variance 
involving an omnibus test of differences in three means in a one- 
way design (in which case the "r sq" effect size would be called 
the correlation ratio or eta^) • 

INSERT TABLE 1 ABOUT HERE. 

The table presents results for fixed effect sizes but 
increasing sample sizes (n'^4, 13, 23, or 33). For the effect 
size (33.6%) reported in the table, the result become 
statistically significant when there are somewhere between 13 and 
23 subjects in the analysis (Welge-Crow, LeCluyse & Thompson, 
1990, p. 5) . 

Craig, Eison and Metze (1976) have found serious distortions 
in interpretations of research studies when researchers failed to 
understand the effect that sample size has upon significance 
tests. When there is a failure to understand how the sample size 
affects results, researchers may ignore large effect sizes 
involving nonsignificant results attributable to small sample 



slSM, vhll« at thm mhmm tium ovar-lntarpreting significant 
raaulta whan tha affact sisa is actually small (Walga-Crov, 
LaCluysa ft Thompson, 1990) • 

2. IntTpratation of Effect Siga g ^tif^^^ 
Effact size can ba characterized as the ^degree to which the 
phenomenon exists** (Cohen, 1977, p. 9). There are numerous 
methods that a researcher may choose with which to estimate the 
affact size for the data (e.g., Hays, 1981; Tatsuoka, 1973) • Tha 
effact size is used by the researcher to **garner some insight 
regarding result importance ** (Welge-Crow, LeCluyse & Thompson, 
1990) • 

McNamara (1990b) demonstrates the usefulness of the effect 
size in allowing the researcher to infer if a meaningful true 
difference occurs within the targeted population. In the 
example that McNamara (1990b) uses, he compares the difference 
between two means from a survey administered to a sample of 
teachers and a sample of administrators. The mean difference for 
a particular item was 0.34, which when divided by the common 
standard deviation of 0.66, yielded an effect size of 0.52. With 
an effect size of 0.52, the researcher can then conclude that **on 
average the questionnaire item score for administrators was a 
0.52 standard deviation higher than the same questionnaire item 
score for teachers** (McNamara, 1990b, p. 29). Thus, the effect 
size is over the one-half common standard deviation effect size 
that Borg (1987) argues represents a meaningful difference 
between two means. 

However, Helge-Crow, LeCluyse, and Thompson (1990) explain 



that sample affect sizes **overestlsate^ the effect size actually 
found in the full population, as well as the effect size that Is 
likely to be found In future studies with different samples. 
This inflation occurs because all classical parametric (e.g., 
tests, ANOVA) methods are correlational methods (Knapp, 1978; 
Thompson, 1988) that capitalize upon the sampling error as a part 
of the least squares analysis (Welge-Crov, LeCluyse & Thompson, 
1990) . However, there are correction formulas available 
(Maxwell, Camp t Arvey, 1981; Rosnow & Rosenthal, 1988) that can 
be applied to correct the estimated population effect sizes based 
on sample results, or in the estimation of effect sizes likely to 
be found in future samples (Welge-Crow, LeCluyse & Thompson, 
19r0) • These corrections tend to be larger when the sample sizes 
are small or if the original effect size is small (Thompson, 
1990) . 

3. Evaluation of Resu lt Repli cabilitv 
Replication of the results of a study, one the eight 
elements of the scientific method (Babbie, 1990) , is a third 
strategy that can be used to facilitate accurate interpretation 
of results. Increasing the estimated likelihood that the results 
will replicate is one of ^^he goals of research and plays a 
crucial part in scholarly inquiry. As Welge-Crow, LeCluyse and 
Thompson (1990) state, contrary to many misconceptions, 
'^signlf icance tests do not evaluate the probability that results 
will replicate" (p. 7). 

One of the easiest and most commonly used methods with which 
to scrutinize the replicability of results is that of 



partitioning thm mavplm and raplicating tha study on the other 
portion. Then comparison of results will either support or will 
bring into question the replicability of the study's results and 
conclusions. Various sample partioning methods have been 
devised, including conventional cross-validation strategies and 
the "^bootstrap** methods developed by Efron and his colleagues 
(Diaconis & Efron, 1983; Efron, 1979). 

One of the cross validation strategies that is the easiest 
to apply is a three step process. Initially the researcher 
randomly splits the sample into two separate subgroups. Next, 
the researcher conducts the same analyses separately on both 
subgroups. Finally, the researcher empirically compares the 
results, attempting to demonstrate the replication of results for 
both subgroups, thus increasing confidence in the replicability 
of the study's results. 

Welge-Crov, LeCluyse and Thompson (1990) describe the 
interpretation of results that the researcher should expect from 
this procedure. The invar iance coefficients obtained for the two 
samples should approach 1.0 for the results to be indicative of 
replicability across samples. If the results do indicate 
replicability, "... the researcher can interpret the set of 
results involving all the subjects with more confidence** (p. 8) . 
Helge-Crow, LeCluyse and Thompson (1990) explain that the 
interpretation of results should always be based upon the total 
sample, not the subgroup splits, because the ••full sample should 
theoretically provide the most generalizable results; sample 
splitting is only performed to evaluate the replicability of the 
results'* (p. 8) . It is also stressed that the results of any 
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r«plicability study must use empirical methods to evaluate 
replicability, not subjective comparison of solutions (Welge- 
LeCluyse ft Thompson, 1990). Results that appear on the surface to 
be different (e.g., yield markedly different beta weights for 
variables) may be remarkably similar in the population effects 
that are estimated. 

The ••bootstrap method devised by Efron and his colleagues 
(Diaconis ft Efron, 1983; Efron, 1979) is considered to be one of 
the most powerful strategies for evaluating the replicability of 
results. The process underlying the ••bootstrap*^ method involvess 
copying the original data set numerous times on top of itself and 
thus creating a ••inega*^ data set. From this ••mega** data set, 
hundreds or even thousands of samples are randomly selected and 
undergo the specific analyses required by the particular study. 
These results are all computed separately for each sample. Once 
all the samples have been analyzed, they are then averaged 
together. The power from this type of method is realized through 
the analytic consideration of ••so mai.y configurations of subjects 
and informs the researcher regarding the extent to which results 
generalize across different configurations of subjects^* (Welge- 
Crow, LeCluyse ft Thompson, 1990, p. 9). 

Thompson and Melancon (1990) provide examples of ••bootstrap^* 
methods, and further explanation of the methods. Welge-Crow, 
LeCluyse and Thompson (1990) include in their paper also an 
example of ••bootstrap^^ estimation. Table 2 presents the small 
data set used in this example. Table 3 presents the descriptive 
statistics for the data in hand (d=12) for this example— these 
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arc th« r«8ult« conventionally calculated by researchers. T?ble 4 
presents the "bootstrap" estlaates of the population correlation 
coefficients based on the data original data. Lunneborg's (1987) 
software, which automates the "bootstrap" method on a 
microcomputer, was used to derive the tabled estimates based upon 
500 resamplings with replacement from the small data set. 

INSERT TABLES 2, 3 AND 4 ABOUT HERE. 

For the Table 4 example illustrated by Welge-Crow, LeCluyse 
and Thompson (1990), it can been seen that the results of the 
"bootstrap" method indicate that the first correlation 
coefficient (0.052) is very close to the mean found in 500 
bootstrap resamplings (0.0514). Such a result would suggest to 
the research<£$r that some confidence can be vested in these 
results in hand, since the sample result so closely approximates 
the result over tseveral hundred configurations of the subjects. 
Vhe relatively large standard deviation (0.3541) for the eighth 
coefficient, however, suggests that this estimate (£=.008) is 
least stable over different groups of the subjects. 

Increasingly, researchers note that they obtain 
statistically significant results, but careful scrutiny of the 
data demonstrates that such differences are not necessarily 
meaningful. As Holmes (1990, p. 72) observes; 

The trouble with reporting statistically 
significant results is twofold. First, all too 
often the word "statistically" gets lost or left 
off. Thus, the researcher reports a "significant 
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dlff«r«nc« vas obtained ...This l«ads to the 
second problnB. When the word ^^slgnlf leant** la used 
In thla way, nost paopla naturally equate It with 
the words ••Important , ••meaningful , •• or 
••practical. •• Just the phrase, ••A significant 
difference was found. •••• carries a certain amount 
of authority. 

Three strategies for augmenting the Interpretation of 
significance test results were Illustrated. The first strategy 
deals with the size of the sample and the effects that too large 
or too small a sample size may have upon significance test 
results. A second strategy Involves using the effect size to 
determine the degree to which the Identified phenomenon Is found 
to exist In the data. Finally, evaluating the repllcablllty of 
the results, using either cross-validation or the ••bootstrap** 
methods developed by Efron and his colleagues, was discussed. 

Researchers spend valuable time and money conducting a 
research project. Niunerous hours spent on the interpretation of 
the results can all be for nothing If the researcher falls in the 
fundamental determination of whether there is actually a 
meaningful effect: size found within the data. Conducting a jt- 
test, ANOVA, or other statistical significance test will inform 
the researcher If there Is ••statistical slgnlf icance^^ , but the 
researcher must delve further into the data to determine if the 
results are ••meanlngf ul^^ . This paper described three such 
methods that can help che researcher make these determinations. 
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Tabltt 1 

Statistical Significance at Various Sample Sizes 
for a Fixed Effect Size (Large Effect Size) 





SOS 


r sq 


df 


MS 


Fcalc 


Fcrit 


Dec. 




9 

99 i • m 


V • 99% 




ice tfAA 


U . Z D J 


^nn on 


MOu Kej 


SOSunexp 


CCI.l 




1 


665.100 






SOStot 


1002.3 




3 


334.100 










99 i • ^ 


V • 99w 


9 

A 


tea CAA 


Z . 9 J 9 


A 1 n 




SOSunexp 


«S.l 




10 


66.510 








SOStot 


1002.3 




12 


83.525 








SOSexp 


337.2 


0.33f 


2 


ICS. 600 


5.070 


3.49 


Rej 


SOSunexp 


«S.l 




20 


33.255 








SOStot 


1002.3 




22 


45.559 








sosexp 


337.2 


0.336 


2 


168.600 


7.605 


3.32 


Rej 


SOSunexp 


«5.1 




.10 


22.170 






SOStot 


1002.3 




32 


31.322 









As sample size increases, tabled ''critical £** values get 
saallar. Additionally, as sample size increases, error j|£ gets 
larger, mean square error gets smaller, and thus "calculated f" 
also gets larger. Entries in bold remain fixed for the purposes 
of these analyses. From Thompson (1989b), with permission. 



Table 2 

Data Set for Hueristic Example 



n 

1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 



KJlLESEC 



890 
1097 
1300 
948 
940 
760 
740 
571 
748 
640 
642 
957 



(+0.18) 
(+1.16) 
(+2.12) 
(+0.45) 
(+0.41) 
(-0.44) 
(-0.53) 
(-1.33) 
(-0.50) 
(-1.01) 
(-1.00) 
(+0.49) 



SYSTOLAV 



94.0 
108.7 
97.7 
90.3 
100.7 
104.3 
95.3 
97.7 
102.7 
96.0 
107.0 
108.0 



(-1.04) 

(+1.42 

(-0.42) 

(-1.66) 

(+0.08) 

(+0.69) 

(-0.82) 

(-0.42) 

(+0.42) 

(-0.70) 

(+1.14) 

(+1.30) 



POND 

11.5 
12.0 
13.1 
12.6 
19.3 
14.7 
14.2 
13.6 
10.9 
11.4 
14.6 
15.2 



(-0.91) 
(-0.69) 
(-0.21) 
(-0.43) 
(+2.49) 
(+0.48) 
(+0.26) 
(+0.00) 
(-1.17) 
(-0.95) 
(+0.44) 
(+0.70) 



TOTCHOL 



180(+0 
142(-1 
165(-0 
199(+1 
187(+1 
148(-1 
164 (-0 
174 (+0 
190(+1 
161(-0 
159(-0 
159(-0 



.64) 
.56) 
.23) 
.74) 
.04) 
.22) 
.29) 
.29) 
.22) 
.46) 
.58) 
.58) 



HDLCHOL 



80.1 
51.1 
63.3 
75.7 
61.0 
76.0 
78.5 
54.3 
62.2 
67.4 
34.8 
70.8 



(+1.17) 
(-1.02) 
(-0.10) 
(+0.84) 
(-0.27) 
(+0.86) 
(+1.05) 
(-0.78) 
(-0.18) 
(+0.21) 
(-2.25) 
(+0.47) 



Note . From Thompson (1990), with permission. 
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D««criptivtt Statistics and Corralation Coefficients 



NILCSEC 


SYStOhkV 


POHO 


TOTCHOL 


HDLCHOL 


PREDCl 


CRITCl 


Mean 852.8 


100.2 


13.6 


169.0 


64.6 


0.0 


0.0 


SD 211.0 


6.0 


2.3 


17.3 


13.2 


1.0 


1.0 


MILBSEC 


.052 


.046 


-.047 


.140 


.063 


.048 


SYSTOIAV 




.244 


-.624 


-.559 


-.981 


-.752 


POND 






.008 


-.121 


-.084 


-.064 


TOTCHOL 








.243 


.637 


.830 


HOLCHOL 










.569 


.742 


PREOCl 












.767 



llStfi. From Thompson (1990), with peinission. 



Table 4 

Bootstrap Estimates of r's for Table 2 Data 
Based on 500 Samples with Replacement 





Table 3 


Bootstrap 


Bootstrap 


Bootstrap 


Coef . 


Estimate 


Mean 


Median 


SD 


1 


0.052 


0.0514 


0.0417 


0.2819 


2 


0.046 


0.0421 


0.0603 


0.2287 


3 


-0.047 


-0.0233 


-0.0489 


0.2690 


4 


0.140 


0.1135 


0.1551 


0.3092 


5 


• 0.244 


0.2598 


0.2428 


0.2343 


6 


-0.624 


-0.5878 


-0.6196 


0.2135 


7 


-0.559 


-0.5430 


-0.5737 


0.2166 


8 


0.008 


-0.0649 


-0.0519 


0.3541 


9 


-0.121 


-0.0971 


-0.1198 


0.2224 


10 


0.243 


0.2189 


0.2486 


0.2560 



Note. From Thompson (1990), with permission. 




